Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

singleton: Allocate singleton instance on heap #79

Closed
wants to merge 13 commits into from

Conversation

res2k
Copy link
Contributor

@res2k res2k commented Nov 21, 2017

This avoids issues at process exit when the inner static singleton_wrapper
happens to be cleaned up before the singleton<> instance.
Makes Travis checks pass on Linux.

@masterleinad
Copy link
Contributor

This probably tries to address the same issue as #69. Do you have a (small) test case that fails without this patch?

@res2k
Copy link
Contributor Author

res2k commented Nov 22, 2017

Compare https://travis-ci.org/boostorg/serialization/builds/305474813 and https://travis-ci.org/boostorg/serialization/builds/301650246:
On develop the test_dll_exported test consistently crashes on Linux. I could also reproduce that locally (gcc 6.4.1 on Fedora 25).

@masterleinad
Copy link
Contributor

Yes, that looks convincing. 😃

@robertramey
Copy link
Member

This looks promising. It would explain why the problem occurs on some environments an not in others. One thing that has slowed us down is that we've been using test_dll_exported to test this because that is where the problem showed up. Looking through the test suite, I see there is no actual test for the singleton. Clearly this would be useful. So if anyone has or can make one, I'd be interested in seeing it and likely incorporating such a test in the set of tests. This would be particularly valuable as apparently some are using the singleton module outside the serialization library. And the singleton has special considerations regarding visibility, inclusion and the like.

@vasild
Copy link

vasild commented Nov 22, 2017

I confirm that the patch in this pull request fixes the crash I describe in #69:

$ cat bug.cc 
#include <boost/archive/binary_iarchive.hpp>
#include <boost/serialization/vector.hpp>
#include <sstream>
#include <vector>

void f()
{
  std::stringstream iss;
  boost::archive::binary_iarchive ar(iss);
  std::vector<int> out;
  ar >> out;
}

int main(int argc, char* argv[])
{
  return 0;
}

Without the patch:

$ ./bug
Bus error (core dumped)

With the patch:

$ ./bug
$ valgrind ./bug
==34751== Memcheck, a memory error detector
==34751== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==34751== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==34751== Command: ./bug
==34751== 
==34751== 
==34751== HEAP SUMMARY:
==34751==     in use at exit: 1,544 bytes in 16 blocks
==34751==   total heap usage: 16 allocs, 0 frees, 1,544 bytes allocated
==34751== 
==34751== LEAK SUMMARY:
==34751==    definitely lost: 0 bytes in 0 blocks
==34751==    indirectly lost: 0 bytes in 0 blocks
==34751==      possibly lost: 0 bytes in 0 blocks
==34751==    still reachable: 1,544 bytes in 16 blocks
==34751==         suppressed: 0 bytes in 0 blocks
==34751== Rerun with --leak-check=full to see details of leaked memory
==34751== 
==34751== For counts of detected and suppressed errors, rerun with: -v
==34751== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
$

Cheerz!

@@ -121,7 +126,8 @@ class singleton : public singleton_module
// our usage/implementation of "locking" and introduce uncertainty into
// the sequence of object initializaition.
use(& m_instance);
return static_cast<T &>(t);
if (!t) t = new singleton_wrapper;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construct does not look thread safe, whereas the original was (at least since C++11). Could this be an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just doing static singleton_wrapper *t = new singleton_wrapper; above should be fine, don't you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm - as far as I'm concerned the original is/was correct. But it seems that with some combinations of compiler/library ...? the order of destruction of the static variable is not the reverse of the construction - as they should be. There is also the question of why we get any kind of memory issue when the program ends. A big motivation for using a static variable was that it would be built into the program image and never touch the heap. But apparently that was wishful thinking. It would qualify this as a work around for unexpected/obscure/incorrect runtime behavior regarding invocation of destructor of static variables. Of course I don't really know. It could be a bug / undefined behavior which no one has seen yet. Given all this uncertainty, I'm putting in the change given that it addresses the failure of one of the tests.

On the other hand, I've since changed the test to make it more correct. It's possible that it was an error in the test that no one sees and that now it's gone. If more information comes to light we can easily change it again. This is not a persistent object.

So, for now, we'll just keep on trucking ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masterleinad Yes I think that would fix the concurrency issue I had in mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the heap allocation:
This could be avoided by declaring a byte array of singleton_wrapper size (and alignment) and using placement new.

But you'd need boost::call_once to ensure thread safety.

While I noticed the problematic order of destruction I have not investigated the reasons in detail, so I unfortunately can't say what exactly goes wrong. But maybe I get around to getting a better explanation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that doing the assignment right in the declaration of the static local variable solves the problem. If you don't want to do this, then you need to do double-checking to make things thread-safe, as shown here: https://en.wikipedia.org/wiki/Double-checked_locking

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if any of you guys want to make a name for yourself, you could factor out the singleton into it's own boost library/component and the serialization library would just import it. When I made it it seemed simple - no heap allocation, initialization before main is called. Destruction after main is exited etc. A no brainer. But it might be that the compiler handling of static variables throws a monkey wrench into all this.

@masterleinad
Copy link
Contributor

It seems you also need to rebase.

@robertramey
Copy link
Member

Hmmm - I'm not convinced here.

One of the main problems that this singleton attempts to address is the the notorious "static variable initialization sequence" issue. The original scheme guaranteed that that items would be be initialized only when first used. This change would break that I believe.

The only reason we're using a raw pointer anyway is to work around an issue which presents itself only in some older compilers. Given the complexity of dealing with this, I'd really prefer that someone - maybe you - make a boost singleton library - maybe based on this so that the serialization library could just include it.

@bangerth
Copy link

That seems like an undue burden on @masterleinad that might lead to the problem never getting fixed :-( I would hope that it's possible to commit a fix without creating a new boost library.

I understand being cautious around these issues. But it's not like everything is alright right now. The situation is already broken, and we know it used to work before.

@robertramey
Copy link
Member

robertramey commented Nov 30, 2017

Hmmm -I believe it was always broken before - it just showed up on one new test test_dll_exported and only on some combinations of compiler/library/OS. As far as I know, there are no current bugs in the serialization library.

A change like this has the potential to cause all sorts of havoc. Has anyone actually ran all the library tests with this change?

@davydden
Copy link

davydden commented Nov 30, 2017

@robertramey

Hmmm -I believe it was always broken before

AFAIK it was working before, specifically 1.63 is ok, 1.65.0 is ok, but not 1.65.1, see this issue dealii/dealii#5262

@juanchopanza
Copy link

Even though the issue is affecting me, I would also urge caution here. Furthermore, it isn't clear to me that the problem is fully understood. It would really help confidence if someone who understands it could provide a more detailed analysis. @robertramey BTW I am also curious as to the reasons for the current complex structure with a function-local static and a static data member. Is this the right forum to ask about that?

@vasild
Copy link

vasild commented Nov 30, 2017

Hmmm -I believe it was always broken before

No, from the duplicate report at #69:

I confirmed that the bug was introduced in b0a794d. There is no crash with b0a794d~ and a crash with b0a794d (which is from Feb 2017).

@robertramey
Copy link
Member

I know the bug was detected in b0a794d But that does not convince me that that it was introduced in that version. In other words, I think the bug is somewhere in the the standard library/compiler implementation and that the change in b0a794d only reveals what was only there. Our "fix" is not really a fix but rather a work around some phenomenon that we don't understand.

@robertramey
Copy link
Member

juanchopanza - the "complex" structure is the triumph of implementation over design. That is, it wasn't actually "designed". I started with something simpler but had to make it a little more complex in order to make it work. A simpler design fails to address a number of issues:

a) sequence of nested static objects. static object a refers to static object b etc. This means that b has to be created before a. But this is under control of the compiler/library before "main" is called - or not.

b) Compilers will optimize code by elimination of code not explicitly referred to. The serialization library creates static objects to contain functions for de-serialization of exported derived pointers. These pointers are accessed through the vtable of the abstract base class so the compiler never sees any reference to them. So some method has to be sure that the compiler doesn't throw them away. In some cases this problem will manifest itself as a failure to link an in others as a runtime error. Note that behavior usually varies between debug and release modes, between compilers and compiler versions and whether the library is deployed as shared library/DLL or static library. Also different tests show different problems. It's not trivial to make this work across all these combinations. If fact the number of combinations is so large, that it's not really possible to test them all. And since we're relying on undefined behavior here, testing is actually the only way to make a guarantee that the library works.

c) The above is a result of the design decision to permit deep serialization to objects through pointers. This is a key feature of the Boost Serialization library. There are libraries which don't do this - e.g. Cereal which as a consequence are much, much simpler. In many cases, they might be a better choice. But they are not comprehensive. The blessing/curse of the Boost serialization library is that one can use it serialize virtually any C++ data structure without having to tamper with the structure design itself. That is, one can more or less completely de-couple the serialization from the rest of one's program. This is very convenient for the library user, but comes at a heavy cost of complexity for the library developer/maintainer

@bangerth
Copy link

bangerth commented Dec 1, 2017

@robertramey: Pragmatically, do you have evidence that the current code is "more correct" than it was before? I do very much sympathize with your desire to come to a well-understood, well-designed code structure. But I'm not sure anyone has the gumption to actually make it happen, and it would be a shame if the perfect stood in the way of improvement.

We know that the code used to work and broke with a particular commit (works just before that patch, fails just after it). Furthermore, this commit can be reverted without too much trouble IIRC. I'm not the expert in this area, but the commit message that went along with it ("Trying to get minGW to function for serialization library") at least doesn't explain what was wrong before and why the change made is correct. Do you recall why this particular change was made?

@robertramey
Copy link
Member

The change was made to address failure of all tests with mingw. These tests are all passing now. I don't recall for a fact whether it was this change or some subsequent one which made the mingw tests pass again.

@bangerth
Copy link

bangerth commented Dec 2, 2017

So, moving forward, where do we find someone with a MingW system who could test the proposed patch to see what happens?

@robertramey
Copy link
Member

LOL - actually, since I made this change I installed a mingw system just to be able to test this. I don't remember whether I tested this specific change on it. FYI I use a Mac with OSX. I bought Parallels, installed Windows, ubuntu linux and Mingw. It was basically the only way I could track down these kind of problems. Short version - works great for these issues. So I could investigate this some more. But it still takes a lot of time because it likely ends in a game of whack-a-mole. Fix it for mingw and it breaks mdvc, fix that and it breaks gcc, etc.

@davydden
Copy link

davydden commented Dec 2, 2017

meanwhile, @res2k could you please rebase to fix the conflict?

@bangerth
Copy link

@robertramey -- since you seem to be the only one with a system on which this can be tested, would you mind testing the patch there?

I don't know what else to suggest: I'm not a BOOST maintainer, so the decision is not mine, but if this was in the projects I work on, I'd choose reverting a small part of a patch that we know broke something, over waiting for the ultimate patch that is, in all likelihood never going to arrive :-(

Short of that, do you have any suggestions how we can proceed with this?

@masterleinad
Copy link
Contributor

Just for clarification: The first commit of this PR was merged meanwhile (7d216b4) leaving the fix of the possible thread-safety issue in this PR.

@bangerth
Copy link

Ah, thanks for the clarification and for rebasing what's left. The remainder appears to be a pretty straightforward change.

@jdmairs
Copy link

jdmairs commented Dec 21, 2017

I too ran into a problem with 1.65.1 where my historic raw archives using boost serialization failed to import. I'm using msvc VS2017 15.3 and 15.4.2. Is there any easy way I can grab a whole file replacement to patch this problem? I would drop back to boost 1.64 but I've already updated my msvc compiler to 15.3 and 1.64 does not recognize the compiler. And to make things worse we use offline installers and no one has VS2017 15.2 anymore!

I also just tried boost 1.66 and my problem still remains.

@res2k res2k force-pushed the singleton-heap-alloc branch from a2fd454 to 4f11789 Compare February 7, 2018 14:35
@res2k
Copy link
Contributor Author

res2k commented Feb 7, 2018

I took another look at the issue.

For one, I tried to understand the reason behind the destruction order issue. As a local static variable is initialized lazily, or possibly not at all, it's cleanup function will be registered dynamically as well (See eg https://godbolt.org/g/YkApUv - note the _cxa_atexit call). That dynamic registration appears to make some mistake - the cleanup function is associated with a shared library that's cleaned up earlier than the module the static variable actually "lives" in is cleaned up - with code executed later assuming the variable is still valid.
That scenario can be worked around by using a global static var to handle the cleanup.

However, other ordering issues appeared with MSVC static builds. There the "solution" is to actually use atexit()

So, on the one hand, that all appears to work (at least the automated tests look promising, see here: https://travis-ci.org/res2k/serialization/builds/338525872 and here: https://ci.appveyor.com/project/res2k/serialization/build/1.0.31-develop). On the other, distinguishing the cases makes the implementation a bit messy. I'll probably try if I can find a cleaner yet one-size-fits-all implementation.

@Flamefire
Copy link

You are still making the mistake I described in #104: You have ctor/dtors for singleton, expected them to do something but never instantiate it. Hence the destroyed flag is never set! This causes crashes for certain destruction orders (e.g. extended_type_info_typeid singleton destructed after ktmap singleton (here:

BOOST_ASSERT(! singleton<tkmap>::is_destroyed());
)

@Flamefire
Copy link

@robertramey

I know the bug was detected in b0a794d But that does not convince me that that it was introduced in that version. In other words, I think the bug is somewhere in the the standard library/compiler implementation and that the change in b0a794d only reveals what was only there. Our "fix" is not really a fix but rather a work around some phenomenon that we don't understand.

The bug was introduced in that revision. See #104 for proof.

The original scheme guaranteed that that items would be be initialized only when first used. This change would break that I believe.

This is (currently and for a while) not true. The scheme used makes sure all singletons are created BEFORE main is entered (so one can use lock although I question its use)

@res2k
Copy link
Contributor Author

res2k commented Apr 25, 2018

Actually, is the singleton<> ctor even called?
Anyway, I'd try dealing with get_is_destroyed() in cleanup_func() as well.

@Flamefire
Copy link

The singleton is never created, so no. That is my point

@res2k
Copy link
Contributor Author

res2k commented Apr 26, 2018

I did a quick test on the branch for this PR: I put assert(false) in singleton ctor resp. dtor, on gcc/Linux; running the unit tests, these asserts generally fired.
(Please take note this branch is not merged into mainline serialization, so issues there may not apply here, and vice versa.)

@Flamefire
Copy link

Then where is it constructed? I'll have another look too and contribute a test for the singleton itself

@Flamefire
Copy link

Ok found it. There is some trickery done there. Some classes are derived from singleton<T>. So singleton<T> will instantiate singleton_wrapper<T> which is derived from T and hence from singleton<T> (ouch, my brain!)

I think this is intended and should be the way singleton<T> has to be used as one can then simply use T::get_const_instance(). However for some classes this is not possible (hello std::map) which leads to problems as singleton<T>::get_const_instance() was not meant to be used. Maybe this should be enforced?

@robertramey
Copy link
Member

OK - I'm going to take another look at this. I wasn't happy with the original "fix", but things need to be fixed. So it looks like we'll have to fix this "fix". I hate this stuff. If some want's to submit a test which detects a memory leak, that might be useful to cut the "whack a mole" cycle and keep us moving forward.

@alfC
Copy link

alfC commented Apr 26, 2018 via email

@Flamefire
Copy link

Every type used in serialization registers itself with the serialization library. This is actually 2 singletons: One for the registered type and one for the register itself. During destruction of the registered type it unregisters itself.

The problem was, that the register was destructed before the registered type and so unregistering lead to a crash due to the destruction check not working. This was partially fixed by using a heap allocated singleton instance which "fixed" it by leaking the register singleton (and possibly some more)

And no the user can't do much about this. Use Boost >= 1.66 and don't use serialization in a shared library that is loaded/unloaded multiple times as this will build up wasted memory.

@alfC
Copy link

alfC commented Apr 26, 2018 via email

@Flamefire
Copy link

Imo it cannot be avoided. The singleton gets instantiated for every type used (I think)

No other solution possible than fixing it.

I'm working on it too (look at my fork. I got test cases atm and will fix them in the next days)

@Flamefire
Copy link

@Flamefire
Copy link

@alfC I added a PR which you can use to batch your version of boost to avoid the problem in the meantime. If you encounter any problems, tell me in that PR: #105

@robertramey
Copy link
Member

I believe this is now addressed. If anyone is unhappy with the resolution, open a new PR

@bangerth
Copy link

@robertramey: For completeness, could you say which pull request/commit fixed the issue? This way it's easier for others later on to follow the conclusion of the discussion.

@robertramey
Copy link
Member

sorry I can't do this. the initial fix was just reversing the latest commit to fix another problem. Once the this was done, the original problem reared it's ugly head generating a months long saga which only recently has started getting sorted out. You can reconstruct all this yourself if it's important to you, but I personally can't spare the time to do this.

@Flamefire
Copy link

@bangerth The commit was f297d80 which is mostly my PR #131
This might not be 100% correct, but answers your question

tamiko added a commit to tamiko/dealii that referenced this pull request May 25, 2022
This has been finally fixed upstream sometime in late 2018 [1], and this
runtime check is very expensive. Let us skip it for a safe, good
version. I have chosen 1.74 semi randomly because it has been released
in 2020 and is the boost version found in Debian 11.

[1] boostorg/serialization#79 (comment)
mkghadban pushed a commit to OpenFCST/dealii that referenced this pull request Sep 8, 2022
This has been finally fixed upstream sometime in late 2018 [1], and this
runtime check is very expensive. Let us skip it for a safe, good
version. I have chosen 1.74 semi randomly because it has been released
in 2020 and is the boost version found in Debian 11.

[1] boostorg/serialization#79 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.